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ABSTRACT 



In this paper we present new improvement ideas of the orig- 
inal PageRank algorithm. The first idea is to introduce an 
evaluation of the statistical reliability of the ranking score of 
each node based on the local graph property and the second 
one is to introduce the notion of the path diversity. The 
path diversity can be exploited to dynamically modify the 
increment value of each node in the random surfer model or 
to dynamically adapt the damping factor. We illustrate the 
impact of such modifications through examples and simple 
simulations. 

Categories and Subject Descriptors 

G.2.2 [Discrete Mathematics]: Graph Theory — Graph 
algorithms; F.2.2 [Analysis of algorithms and prob- 
lem complexity]: Nonnumerical Algorithms and Prob- 
lems — Sorting and searching; H.3.3 [Information storage 
and retrieval]: Information Search and Retrieval — rele- 
vance feedback, search process 

General Terms 

Algorithms 

Keywords 

ranking, web graph, random walk, reliability, diversity. 

1. INTRODUCTION 

There was an important research investment during at 
least 10 years on PageRank algorithm and related topics (cf. 
[221 [71 El EQl HZl [M gl [13J El El [23]), but there were few 
results concerning the original PageRank algorithm modifi- 
cation. 

PageRank is a nice solution to evaluate the importance 
of the nodes of a graph based on the resolution of a fixed 
point problem associated to the random surfer model and 
to the Markov chain associated to the random walk. The 
PageRank algorithm can be then seen as a Perron- Frobenius 
problem (simplified formulation): 

A.X = X 

where A is the transition matrix associated to the random 
surfer model (the size of the state is N, if there are N URLs) 



A(i,j) 



N i 



and X the stationary probability. X measures the rele- 
vancy of each URL (cf. |21l 1151 \T7\ ), which is proportional 
to the average sojourn time at each node during the random 
walk. 

In this paper, we are interested in investigating one very 
specific issue which may be the Achilles' heel of PageRank. 
This issue is related to the possible impacts from the choice 
of the damping factor [TU H Q] [10] [9]. The role of the 
damping factor in the initial PageRank algorithm can be 
associated in the random surfer model to the probability 
that the surfer gets bored after several clicks and switches 
to a random page. More technically speaking, it may have 
three roles: 

• [Irreducibility] firstly, it plays a role of mixing all nodes 
and making the associated Markov chain irreducible 
(i.e. we have a single connected component); 

• [Indirect inheritance] secondly, it controls directly the 
way the importance weights are inherited when follow- 
ing the links (cf. illustration in Figure [TJ; as a conse- 
quence, it impacts the global ranking results (cf. [121 

El); 

• [Trap nodes] finally, it avoids the random walk staying 
too long in a trap position; the trap position can be 
one node (loop) or a group of nodes from which the 
outbound links are all local; the damping factor would 
enable to leave such a position and explore the whole 
space. 

Because of the second point above, we think that the 
damping factor could partially induce an arbitrary ranking 
results, which may be undesirable. This is further illus- 
trated in Figure [T] the node 12 inherits most of scores from 
i\. More precisely, if the score of i\ is C(ii), ii inherits 
from i\: C(ii) /2 x (1 — e) , where 1 — e is the damping factor 
and C(ii) is divided by the number of outgoing links from 
i\. Therefore, when applying PageRank family approaches 
this indirect influence depends directly on the value of the 
damping factor. 

In the next sections, we present how we can correct or 
at least control such an impact based on the idea of the 
statistical reliability (Section[2]) and on the idea of the path 
diversity (Section O . 

2. STATISTICAL RELIABILITY 

We consider the random walk model on a graph Q oi N 
nodes where each transition from node j to node i is de- 
fined by p(i,j). In particular, we focus on a homogeneous 




Figure 1: 12 inherits scores from i\ 

graph (one could extend the same approach on a heteroge- 
neous graph) where p(i,j) is defined by N(j) the number of 
outgoing links from node j (if there is a link from j to i): 

where 1 — e is the commonly called damping factor [21] . 

In this section, we assume we already solved the original 
PageRank equation 



X(i) 



N^ 1 



N 1 • 

j=i 



N(j) 



(1) 



to find the ranking of each node i of the graph , X(i) defining 
the importance weight (real value between and 1) of the 
node i. 

Now, we introduce a method to evaluate the statistical 
reliability of X(i) based on the distribution of the local in- 
coming links's contribution. 

2.1 General expression 

We assume that the Markov chain associated to the ran- 
dom walk is described by the transition probability P with 
p(i,j) — Pi.j (probability to jump from j to i) and its sta- 
tionary probability X = (xi, ...,xn)- 

Then we define the quantity r(i,j) by: 



r(i,j) 



P(i,j) x Xj/xi 



(2) 



By definition X/j=i r (*)i) = 1 and r(i,j) can be simply 
interpreted as the contribution of j on Xi. 

We define the following quantity measuring say the sta- 
tistical error on Xi\ 



E{i) = £(r(i,j)) Q x/3 
j'=i 



(3) 



with a > 1 and f3 £ [0, 1] (for instance, /3 = 0.5 or 1 and 
a = 2 which seems to be the most natural choices). And we 
define: 



(4) 



The function F(i) can be interpreted as what we called 
the statistical reliability measure of Xi\ F(i) is close to one 
when the Xi is obtained from an equal contribution of a 
large number of incoming links (F(i) = 1 — P/n, if equal 
contribution from n links), whereas when the distribution 
of r(i,j) is concentrated on a single node j, F(i) becomes 
close to 1 — /3 which is its minimum value. 

Remark 1. In the computation of F, we can include or 
not the transition probability resulting from the damping fac- 
tor; however, it seems more natural to exclude it, since this 



is an artefact introduced for the computation and is not part 
of human built links. 

Remark 2. The function F(i) can be also interpreted as 
an evaluation of the robustness of the score Xi, if for instance 
one local incoming link should be dropped. 

2.2 Random walk based counters 

Here, we assume that we maintain a counter vector C of 
size N to count the number of visits of all nodes during the 
random walk. We define a counter matrix R of size N x N 
and we increment the counter R(i,j) by one when we jump 
from j to i node. If we call C(i) the counter associated to the 
node i, the ratio r(i,j) — R(i, j) /C(i) gives the contribution 
ratio of node j on C(i). Then, we can define F(i) function 
as above. 

Based on the formula ([2]), one may adapt the computa- 
tion of r(i,j) when other strategies are used to solve the 
PageRank equation ([T}. 

2.3 Exploitation of the reliability function f 

The function F can be exploited for several purposes: 

• for the visualization issue: when we need to show the 
K most relevant nodes associated to a node i, we can 
select the K nodes based on the K highest ratio r(i,j); 
this idea can be generalized taking into account all dis- 
tant neighbour nodes by summing the products of the 
form r(i,ji) x r(ji,j2)— x r {jn,j) to a distant node 
j considering all possible path to i; in such a general- 
ization, we could also take into account the damping 
factor by multiplying by (1 — e) n depending on the 
length of the path n; for the nodes pointed directly or 
indirectly by i, we can obviously use Xi\ 

• the distribution of r(i,j) may be interpreted as a sta- 
tistical signature of the ranking of i and can be used 
to qualify the node i's ranking and even more it can 
be used to modify the ranking value itself (cf. Section 
11- 

3. PATH DIVERSITY 

Here we define the notion of the path diversity to dif- 
ferentiate the increment value (for the random walk based 
counters C, see Section I2.2|l . The main motivation of this 
approach is to avoid to give too high importance to termi- 
nal or trap positions without being forced to play with the 
damping factor which may have other global effect (such as 
what we called indirect inheritance in Section (TJ. 

We could use more or less aggressive definition of the path 
diversity. Here we give three different formulations. 

3.1 Path diversity PD1 

This is the mildest version: we keep a memory of the 
L (if the length of the path from the initial position or the 
last reinitialized node is less than L, we take the path length 
from this position for L) last visited nodes LP — (m, ul) 
(where ni is the last recently visited node) and define the 
path diversity div(LP) by the equation: 



div(LP) 



Eti m x 9(i) 



(5) 



where / and g can be defined in two ways: 



Power-law model: 

m = 

g(i) = 

Exponential model: 

f(i) 
</(0 



i ; -a 



E 



j>i,ni =rij 



j > 1,711 ~ n j 



(6) 
(7) 



(8) 
(9) 



where 5 should be less than 0.5. The specific choice of 8 = 
0.5 seems the most interesting candidate (if L is very large 
and the path is a local loop on a same node, g(i) would tend 
to zero). 

3.2 Path diversity PD2 

Here, we assume that we keep memory of the full path 
from the last reinitialization time (due to terminal positions 
or application of damping factor). If the last visited nodes 
are LP — (m, ...,ul) (where ni is the last recently visited 
node) and the current position is no, we define the path 
redundancy of depth i as: 

red(i) = g{i), if LP includes a path of length i with a 
first node equal to no and the last node 
equal to 
= 0, otherwise 

and the path diversity as: 

00 L 

div(LP) = max(^2g(i) - £Ved(i),0). (10) 

1 i 

Function g can be defined in different ways, in particular 
we can define two types of model: 



Power-law model: 



Exponential model: 



9(i) = ^- 
l a 



g(i) 



1 

V 



(11) 



(12) 



The specific choice of 7 = 2 seems to be an interesting nat- 
ural candidate (so that the increment is equal to 1 for a full 
diversity, i.e. all nodes are different). 

3.3 Path diversity PD3 

This one is probably the most aggressive version of diver- 
sity: this is as PD2, but with the following formulation: 



div(LP) = 0, if there is a node m equal to no, 
= 1, otherwise. 

However, the impact of PD3 if we continue the random 
walk is not clear (we may have div(LP) equal to zero than 
equal to 1). And such a definition would be more relevant 
associated with ideas of Section [ 



elements in LP is small and with a higher impact when 
the duplicated node position is closer to the current posi- 
tion (PD1) or when the size of the duplicated jump is small 
(PD2). 

Remark 4. The notion of path diversity is natural in the 
context of the random walk. If the PageRank equation is 
to be solved/computed differently, an adaptation of this ap- 
proach may not be feasible and/or introduce an additional 
computation cost. 

Remark 5. In a practical solution, PD1 should have a 
minor impact on the global ranking, whereas PD3 will pe- 
nalize the most the trap positions. With the usually applied 
value of the damping factor (i.e. 0.85), the depth of the 
graph traversal before reinitialization is small, hence PD3 
definition can make sense in most of situations in a large 



3.4 Other application of the path diversity 

Another possible way to exploit the path diversity is to 
take the damping factor as a function of the path diversity, 
for instance with PD1, PD2 or PD3. 

A simple concrete solution can be: set e = if the current 
position has been already visited in the past (from the last 
reinitialization time): in Section [5] we show some results of 
such a strategy we called PR+D. 

4. RANKING MODIFICATION 
4.1 Reliability based modification 

We propose a new adaptation of PageRank replacing the 
initial ranking X(i) as follow (PRxF): 



X'(i) 



F{i) x X{i). 



(13) 



Remark 3. The common intuition of formulas above is 
to define a function that decreases as the number of unique 



One simple motivation of such a modification is to dif- 
ferentiate the case when the i's ranking is mostly inherited 
from a very small number of significant neighbour nodes (say 
Dirac type distribution) from the case when the contribu- 
tion from the neighbour nodes are spread on a large number 
of them (more uniform distribution). From such an infor- 
mation, one may decide (depends of course on the context) 
to credit more scores (or importances) on the nodes that 
depends more uniformly on a large number of nodes, which 
could be also a sign of the consensus on the ranking. 

With such a modification, it makes sense to use a smaller 
damping factor than 0.85. 

We further illustrate this in two simple examples below. 

4.1.1 Example case CI 

We set a = 2, (3 = 0.5 and e = 0.15. In Figured if 
we assume a has a reference score of 4 and c a score of 3 
(up to a constant multiplicative factor), b inherits from a: 
4 x 0.85 = 3.4 which is higher than c. 

Applying the reliability function F: we get for a: 4 x (1 — 
13/4) = 3.5, b: 3.4 X (1 - /3) = 1.8 and for c: 3 x (1 - /3/3) = 
2.5. 

4.1.2 Example case C2 

We set a = 2, f3 = 1 and e = 0.15. In Figure [3] From ini- 
tial PageRank: b inherits from a (assuming a score 6 for a): 
6 x 0.85 = 5.1. Then within b, c, d, the average sojourn time 




Figure 2: b inherits scores from a 




Figure 3: Case C2 



before the reinitialization is 1/0.15 to be shared between the 
3 nodes. Therefore, we have for (a, b, c, d): (6, 7.3, 2.2, 2.2). 

Applying the reliability function F: we get for (a,b,c,d): 
(5,3.4,1.1,1.1). 

Now with e = 0.05: with the initial PageRank we ob- 
tain: (6, 12.4, 6.7, 6.7) and applying the reliability function 
we have: (5,8.0,3.3,3.3). This scenario shows also the ne- 
cessity of keeping the damping factor not too small to avoid 
the deadlock position in b, c, d which tends to overestimate 
their importance. 

The modification proposed in the next section should al- 
low one to take e close to zero without putting a too big 
importance weight on the nodes b, c, d. 

4.2 Path diversity based modification 

Here we illustrate the impact of the introduction of the 
path diversity in the original PageRank equation. 

4.2.1 Example case CI 

In this simple case, all visited nodes (before a reinitializa- 
tion is required when reaching nodes b or c) are different. 
Therefore, the path diversity div(LP) should be constant 
and does not impact the ranking. 

4.2.2 Example case C2 

In this scenario, when e is close to 0, b, c, d become a 
trap position and their importance weights should asymp- 
totically sum up to 1. With PD1, the importance weight of 
a tends to zero as well (not aggressive enough). Introducing 
the path diversity with PR2 or PD3, even if e is equal to 0, 



the increment values for b, c, d quickly tend to zero, guaran- 
teeing a strictly positive weight of a (and of other nodes). 
With P+D, the importance weights of b,c,d axe the most 
penalized. 

5. SIMULATION RESULTS 

Here we set a simple simulation scenario to get a first 
evaluation of our proposed solution and comparison to the 
original PageRank approach on the web graph. We don't 
pretend to generate any realistic model, for more details on 
the web graph the readers may refer to [TT1 [TBI [5l l6l [T8] . 

5.1 Scenario 

We set N the total number of nodes (URLs) to be simu- 
lated. Then we create L random links (directional) to con- 
nect a node i to j as follow: 

• the choice of the source node is done following a uni- 
form sampling in Scenario SI or following a power-law: 
l/k a in Scenario S2; 

• the choice of the destination node is done following a 
power-law: l/k a . 
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Figure 4: Scenario SI: number of incoming and out- 
going links: 27880 links created. 

For simplicity, we assumed no correlation between the 
number of incoming and outgoing links: in Scenario 1, a uni- 
form sampling does not introduce correlation. In Scenario 
2, we first associate to node k a probability proportional 
to l/k a followed by a large number (by default N) of per- 
mutations of randomly chosen pair of nodes the final 
results define the randomized probability to be chosen as a 
source node. In both scenarios, we order the N nodes by its 
popularity (probability to be chosen as destination node), 
associating a probability proportional to l/k a to the node 
position k: in the following, we call this the native order 
which is very close to the ordering by the number of incom- 
ing links (and not equal because of the random realization). 

When the link already exists between the source and des- 
tination nodes, we don't modified anything (that's why we 
have less than L links created). The consequent results on 
the number of incoming and outgoing links are shown in 
Figures [4] and [5] (N = 1000 and L = 100 x N, a = 1.5). 
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Figure 5: Scenario S2: number of incoming (same 
as SI) and outgoing links (JV random permutations): 
9533 links created. 

Figure [6] shows the power-law on the number of incoming 
links (by construction) and on the number of nodes with k 
incoming links (as a consequence). 
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Figure 6: Scenario S1/S2: number of incoming links 
(w.r.t. node position) and number of nodes with k 
incoming links (w.r.t. k) in logscale. a = 1.5. 

5.2 Analysis 

5.2.1 Scenario SI 

Figure [7] shows the results of PageRank based ranking rel- 
evancy scores of the N nodes: in this case, the PageRank 
ranking follows closely the number of incoming links based 
ordering and the application of the function F merely mod- 
ifies the results (we can only notice a bit more smoothed 
curve): because the choice of the source nodes is made ran- 
domly, the differentiation of the N nodes are only based on 
their difference on the probability to be chosen as destina- 
tion node. So we can consider here the ranking based on the 
number of incoming links as the theoretically optimal one. 

To evaluate the difference of the ranking scores of two 
ranking approaches Rl and R2 (associated to their respec- 
tive normalized relevancy scores Xx() and X2O), we define 



Figure 7: Scenario SI: PR and PRxF. 

the average deviation by: 

1 JV i 

dev(Rl, R2) = - I E Xl (*) - * 2 (*) I 

i=i fe=i 

where ^2\ =1 Xx(k) gives the importance score of the i first 
nodes (following the native order). 
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Figure 8: Scenario SI: CDF of #incoming, PR, 
PRxF, PR+D and PRxF+D. 

In Table [T] we show the average deviation (w.r.t. the 
number of incoming links) for the four approaches. The ap- 
plication of F makes the scores much more closer to the or- 
dering scores based on the number of incoming links, which 
is expected since the factor F will tend to favour the nodes 
receiving more incoming links. The application of the diver- 
sity (PR+D, cf. Section f3.4|l mainly reduces here the score 
of the best ranked nodes and this explains its higher average 
deviation compared to PR. 

To better highlight the difference of the ranking scores 
of two approaches, a node level deviation measure is de- 
fined as follows: given two ranking approaches Rl and R2, 
we first evaluate node per node its relevancy score ratio to 
the average value by: Y(i) = X(i) x N/J2i X (i) for Rl 
and R2; then we measure the deviation between Rl and R2 
by dev(i) = Yz{i)/Yx(i). The results are shown in Figure 
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PR 


PRxF 


PR+D 


(PR+D)xF 


1.5 


0.062 


0.0055 


0.084 


0.0077 


2.0 


0.071 


0.0082 


0.12 


0.0047 


2.5 


0.073 


0.0028 


0.14 


0.0017 



Table 1: Average deviation. 



[9] we compared PR and PRxF to the ranking score based 
on the number of incoming links. We observe more clearly 
the fact that the deviation is much more reduced when the 
function F is applied: this node level deviation evaluation 
allows one to easily observe the differences at different rank- 
ing scale. We also see that the deviation is naturally higher 
when there are more noises (when the number of incoming 
links decreases). 
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Figure 9: Scenario SI: deviation measure. 
(PR/#incoming) and (PRxF/#incoming). 
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Figure 10: Scenario S2: PR and PRxF. 



5.2.2 Scenario S2 

Figure [10] shows the results of PageRank based ranking 
relevancy scores of the N nodes: in this case, there are 5 
visible nodes after position 100 having a very good rele- 
vancy score with PR. Because of the relevancy inheritance 
of PR, even if they have few incoming links, their scores are 
very high when they are pointed by the best ranked nodes 
distributing few outgoing links. We can see that with the 
application of F this effect disappears. 

The comparison of PR and PRxF for the deviation mea- 
sure is shown in Figure [TT] We clearly see the big deviations 
with the 5 nodes we mentioned above. 

5.2.3 Scenario S2b 

The scenario S2b is as S2, except for the node 1 we im- 
posed one outgoing link to the node 100 and the node 100 
has also an unique outgoing link pointing to itself. This sce- 
nario is meant to illustrate the impact of the trap position 
and how we can control this impact. 

Figure \T%\ shows the results of PR and PR x F. Both 
results shows a very high relevancy score of the node 100. 
In fact, the relevancy score of the node 100 (with PR) mainly 
comes from the inheritance from the node 1 which can be 
estimated by PR(1) x 0.85/0.15 = 0.08785 x 0.85/0.15 = 
0.4978 (which is very close to x 100 = 0.4993). Node 100 has 

















dev{i): PR/#inc 
-dev(i): PR*F/#inc 


ming 
rning 
1 

-1 






























































































































































- 












ft- " 



















100 200 300 400 500 600 700 800 900 1000 

Position of node 

Figure 1 1 : Scenario S 2 : deviation measure . 
(PR/#incoming) and (PRxF/#incoming). 



17 incoming links, but the main contribution is from the 
node 1 and as a consequence it has a small reliability score 
of 0.25. This reduced score decreased the importance of the 
node 100 (from 0.5 to 0.13), but can not control the effect 
of the self-pointing influence. 
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Figure 12: Scenario S2b: PR and PRxF. 

In Figure [13] we show the results of PR+D: here we see 
that with PR+D, we suppressed the self-pointing influence, 
but still the node 100 inherits from the first node a score of: 
0.15 * 0.85 = 0.13 (the score of the node 1 is of course mod- 
ified by modifying the damping factor value dynamically). 

Now applying the function F on PR+D, we see that the 
node 100 is no more differentiated. 
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Figure 13: Scenario S2b: PR+D and (PR+D)xF. 

Here we showed a rather extreme case, with a maximum 
penalty with F (/3 = 1) and with a maximum penalty from 
a non-diversity to show how much it can impact and more 
importantly to illustrate the fact that we can keep a control 
on what we called indirect inheritance and the impact of trap 
nodes thanks to our modifications. In a practical solution, 
it is necessary to correctly tune these values. 

6. CONCLUSION 

In this paper, we defined the statistical reliability func- 
tion associated to each node of the graph and showed how it 



can be applied to possibly improve the initial algorithm of 
PageRank results. We also discussed the benefit of introduc- 
ing the notion of the path diversity to modify the increment 
value during the random walk or to modify the damping fac- 
tor. We showed the possible consequences through simple 
simulation scenarios. 

In a future work, we expect to test /validate those ideas 
through a real data based evaluation. 
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